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Abstract 

We propose a variant of the classical conditional gradient method (CGM) for sparse inverse prob¬ 
lems with differentiable measurement models. Such models arise in many practical problems including 
superresolution, time-series modeling, and matrix completion. Our algorithm combines nonconvex and 
convex optimization techniques: we propose global conditional gradient steps alternating with nonconvex 
local search exploiting the differentiable measurement model. This hybridization gives the theoretical 
global optimality guarantees and stopping conditions of convex optimization along with the performance 
and modeling flexibility associated with nonconvex optimization. Our experiments demonstrate that our 
technique achieves state-of-the-art results in several applications. 


1 Introduction 

A ubiquitous prior in modern statistical signal processing asserts that an observed signal is the noisy mea¬ 
surement of a few weighted sources. In other words, compared to the entire dictionary of possible sources, 
the set of sources actually present is sparse. In the most abstract formulation of this prior, each source is 
chosen from a non-parametric dictionary, but in many cases of practical interest the sources are parame¬ 
terized. Hence, solving the sparse inverse problem amounts to finding a collection of a few parameters and 
weights that adequately explains the observed signal. 

As a concrete example, consider the idealized task of identifying the aircraft that lead to an observed 
radar signal. The sources are the aircraft themselves, and each is parameterized by, perhaps, its position 
and velocity relative to the radar detector. The sparse inverse problem is to recover the number of aircraft 
present, along with each of their parameters. 

Any collection of weighted sources can be represented as a measure on the parameter space: each source 
corresponds to a single point mass at its corresponding parameter value. We will call atomic measures 
supported on very few points sparse measures. When the parameter spaces are infinite—for example the 
set of all velocities and positions of aircraft—the space of sparse measures over such parameters is infinite¬ 
dimensional. This means that optimization problems searching for parsimonious explanations of the observed 
signal must operate over an infinite-dimensional space. 

Many alternative formulations of the sparse inverse problem have been proposed to avoid the infinite¬ 
dimensional optimization required in the sparse measure setup. The most canonical and widely applicable 
approach is to form a discrete grid over the parameter space and restrict the search to measures supported 
on the grid. This restriction produces a finite-dimensional optimization problem [6, 32, 46]. In certain 
special cases, the infinite-dimensional optimization problem over measures can be reduced to a problem of 
moment estimation, and spectral techniques or semidefinite programming can be employed [19, 35, 47, 9]. 
More recently, in light of much of the work on compressed sensing and its generalizations, another proposal 
operates on atomic norms over data [11], opening other algorithmic possibilities. 

While these finite-dimensional formulations are appealing, they all essentially treat the space of sources 
as an unstructured set, ignoring natural structure (such as differentiability) present in many applications. 
All three of these techniques have their individual drawbacks, as well. Gridding only works for very small 
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parameter spaces, and introduces artifacts that often require heuristic post-processing [46]. Moment methods 
have limited applicability, are typically computationally expensive, and, moreover, are sensitive to noise and 
estimates of the number of sources. Finally, atomic norm techniques do not recover the parameters of the 
underlying signal, and as such are more naturally applied to denoising problems. 

In this paper, we argue that all of these issues can be alleviated by returning to the original formulation 
of the estimation problem as an optimization problem over the space of measures. Working with measures 
explicitly exposes the underlying parameter space, which allows us to consider algorithms that make local 
moves within parameter space. We demonstrate that operating on the infinite-dimensional space of measures 
is not only feasible algorithmically, but that the resulting algorithms outperform techniques based on gridding 
or moments on a variety of real-world signal processing tasks. We formalize a general approach to solving 
parametric sparse inverse problems via the conditional gradient method (CGM), also know as the Frank- 
Wolfe algorithm. In §3, we show how to augment the classical CGM with nonconvex local search exploiting 
structure in the parameter space. This hybrid scheme, which we call the alternating descent conditional 
gradient method (ADCG), enjoys both the rapid local convergence of nonconvex programming algorithms 
and the stability and global convergence guarantees associated with convex optimization. The theoretical 
guarantees are detailed in §5, where we bound the convergence rate of our algorithm and also guarantee 
that it can be run with bounded memory. Moreover, in §6 we demonstrate that our approach achieves 
state-of-the-art performance on a diverse set of examples. 

1.1 Mathematical setup 

In this subsection we formalize the sparse inverse problem as an optimization problem over measures and 
discuss a convex heuristic. 

We assume the existence of an underlying collection of objects, called sources. Each source has a non¬ 
negative weight w > 0, and a parameter 0 S 0. An element 9 of the parameter space 0 may describe, for 
instance, the position, orientation, and polarization of a source. The weight may encode the intensity of 
a source, or the distance of a source from the measurement device. Our goal is to recover the number of 
sources present, along with their individual weights and parameters. We do not observe the sources directly, 
but instead are given a single, noisy measurement in 

The measurement model we use is completely determined by a function '0 : 0 —>■ which gives the d- 

dimensional measurement of a single, unit-weight source parameterized by a point in 0. The measurement of 
a lone source is homogeneous of degree one in its weight; that is, a single source with parameter 9 and weight 
w > 0 generates the measurement w'ij}{9) € K^. Finally, we assume that the measurement of a weighted 
collection of sources is additive. In other words, the (noise-free) measurement of a weighted collection of 
sources, {{wi,9i)}^^, is simply 

K 

( 1 - 1 ) 

We refer to the collection {{wi,9i)}fLi as the signal parameters, and the vector ^ 

noise-free measurement. 

We can encode the signal parameters as an atomic measure /r on 0, with mass Wi at point dp. p, = 
As a consequence of the additivity and homogeneity in our measurement model, the total 
measurement of a collection of sources encoded in the measure /r is a linear function ^ oi p: 

^p = j mdm- 

We call <I> the forward operator. For atomic measures of the form p = WiSsi, this clearly agrees with 

(1.1); but it is defined for all measures on 0. 

We now introduce the sparse inverse problem as an optimization problem over measures. Our goal is to 
recover ptme from a measurement 

y = ^’Mtrue + V 
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corrupted by a noise term, v. Recovering the signal parameters without any prior information is, in most 
interesting problems, impossible; the operator <i> is almost never injective. However, in a sparse inverse 
problem we have the prior belief that the number of sources present, while still unknown, is small. That is, 
we can assume that ^true is an atomic measure supported on very few points. 

To make the connection to compressed sensing clear, we refer to such measures as sparse measures. Note 
that while we are using the language of recovery or estimation in this section, the optimization problem we 
introduce is also applicable in cases where these may not be a true measure underlying the measurement 
model. In §2 we give several examples that are not recovery problems. 

We estimate the signal parameters encoded in /xtrue by minimizing a convex loss ^ of the residual between 
y and $/i: 

minimize ^ (<i>/x — y) 

subject to ^ > 0 (1.2) 

|supp(Ai)| < iV, 

where the optimization is over the infinite-dimensional space of measures on 0. For example, when ^ is the 
negative log-likelihood of the noise term v, problem (1.2) corresponds to maximum likelihood estimation 
of /itrue- Here iV is a posited upper bound on the size of the support of the true measure pitrne, which we 
denote by supp(/itrue)- Although here and elsewhere in the paper we explicitly enforce the constraint that 
the measure be nonnegative, all of our discussion and algorithms can be easily extended to the unconstrained 
case. 

While the objective function in (1.2) is convex, the constraint on the support of /x is nonconvex. A 
common heuristic in these situations is to replace the nonconvex constraint with a convex surrogate. The 
standard surrogate for a cardinality constraint on a nonnegative measure is a constraint on the total mass. 
This substitution results in the standard convex approximation to (1.2): 

minimize (. (d>/x — y) 

subject to ^ > 0 (1.3) 

/x(0) < T. 

Here r > 0 is a parameter that controls the total mass of /x and empirically controls the cardinality of 
solutions to (1.3). While problem (1.3) is convex, it is over an infinite-dimensional space, and it is not 
possible to represent an arbitrary measure in a computer. A priori, an approximate solution to (1.3) may 
have arbitrarily large support, though we prove in §5 that we can always find solutions supported on at most 
d-f 1 points. In practice, however, we are interested in approximate solutions of (1.3) supported on far fewer 
than d + 1 points. 

A celebrated example of (1.3) occurs when 0 is the finite set {1,..., fc} and f(r) = In that case, a 

nonnegative measure over 0 can be represented as a vector v in and the forward operator <I> as a matrix 
in The total mass of the nonnegative measure v is then simply — Iklli- this case, (1.3) 

reduces to the nonnegative lasso. 

In this paper, we propose an algorithm for the substantially different case where 0 has some differential 
structure. Our algorithm is based on a variant of the conditional gradient method that takes advantage of 
the differentiable nature of ip, and is guaranteed to produce approximate solutions with bounded support. 

1.2 Relationship to atomic norm problems 

Problems similar to (1.3) have been studied through the lens of atomic 
corresponding to a suitable collection of atoms A C is defined as 

||a;|U = inf < ^ Ca : a; = ^ CaO, Ca > 

a£A 


norms [11]. The atomic norm || • \\j^ 
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The connection to (1.3) becomes clear if we take A = {'>p(0) : 0 € &} U {0}. With this choice of atomic set, 
we have the equality 

IklU = inf |m(0) ■ ^ = y M > o| • 

This equality implies the equivalence (in the sense of optimal objective value) of the infinite-dimensional 
optimization problem (1.3) to the finite-dimensional atomic norm problem: 

minimize £(x — y) , ^ 

^ (1.4) 

subject to ||a;|U ^ 

Much of the literature on sparse inverse problems focuses on problem (1.4), as opposed to the infinite¬ 
dimensional problem (1.3). This focus is due to the fact that (1.4) has algorithmic and theoretical advantages 
over (1.3). First and foremost, (1.4) is finite-dimensional, which means that standard convex optimization 
algorithms may apply. Additionally, the geometry of the atomic norm ball, conv{'0(6*) : 0 G 0}, gives clean 
geometric insight into when the convex heuristic will work [11]. 

With that said, we hold that the infinite-dimensional formulation we study has distinct practical advan¬ 
tages over the atomic norm problem (1.4). In many applications, it is the atomic decomposition that is of 
interest, and not the optimal point a;* of (1.4); reconstructing the optimal /r* for problem (1.3) from a;* can 
be highly nontrivial. For example, when designing radiation therapy, the measure /i* encodes the optimal 
beam plan directly, while the vector a:* = is simply the pattern of radiation that the optimal plan 
produces. For this reason, an algorithm that simply returns the vector a;*, without the underlying atomic 
decomposition, is not always useful in practice. 

Additionally, the measure-theoretic framework exposes the underlying parameter space, which in many 
applications comes with meaningful and useful structure—and is oftentimes more intuitive for practitioners 
than the corresponding atomic norm. Naive interpretation of the finite-dimensional optimization problem 
treats the parameter space as an unstructured set. Keeping the structure of the parameter space in mind 
makes extensions such as ADCG that make local movements in parameter space natural and uniform across 
applications. 


2 Example applications 

Many practical problems can be formulated as instances of (1.3). In this section we briefly outline a few 
examples to motivate our study of this problem. 

Superresolution imaging. The diffraction of light imposes a physical limit on the resolution of optical 
images. The goal of superresolution is to remove the blur induced by diffraction as well as the effects of 
pixelization and noise. For images composed of a collection of point sources of light, this can be posed as a 
sparse inverse problem as follows. The parameters 9i,... ,6 k denote the locations of K point sources (in 
or M^), and wt denotes the intensity, or brightness, of the ith source. The image of the ith source is given 
by Wiip{9i), where '0 is the pixelated point spread function of the imaging apparatus. 

By solving a version of (1.3) it is sometimes possible to localize the point sources better than the diffraction 
limit—even with extreme pixelization. Astronomers use this framework to deconvolve images of stars to 
angular resolution below the Rayleigh limit [37] . In biology this tool has revolutionized imaging of subcellular 
features [16, 41]. A variant of this framework allows imaging through scattering media [31]. In §6.1, we show 
that our algorithm improves upon the current state of the art for localizing point sources in a fluorescence 
microscopy challenge dataset. 

Linear system identification. Linear time-invariant (LTI) dynamical systems are used to model many 
physical systems. Such a model describes the evolution of an output yt S K based on the input ut € M, 
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where t G Z_|_ indexes time. The internal state at time t of the system is parameterized by a vector xt G ffi™, 
and its relationship to the output is described by 


xt+i = Axt + But 
yt = Cxf 


Here C is a fixed matrix, while xq,A, and B are unknown parameters. 

Linear system identification is the task of learning these unknown parameters from input-output data— 
that is a sequence of inputs Ui,... ,ut and the observed sequence of outputs yi,...,yT [43, 19]. We pose 
this task as a sparse inverse problem. Each source is a small LTI system with 2-dimensional state—the 
measurement model gives the output of the small system on the given input. To be concrete, the parameter 
space 0 is given by tuples of the form {xo,r,a, B) where xq and B both lie in the £ca unit ball in K^, r is 
in [0,1], and a is in [0,7r]. The LTI system that each source describes has 

H = C=[l 0]. 

sin(Q;) cos(a) J 

The mapping ip from the parameters {xo,r,a,B) to the output of the corresponding LTI system on input 
ui,... ,ut is differentiable. In terms of the overall LTI system, adding the output of two weighted sources 
corresponds to concatenating the corresponding parameters. 

In §6.1, we show that our algorithm matches the state of the art on two standard system identification 
datasets. 

Matrix completion. The task of matrix completion is to estimate all entries of a large matrix given 
observations of a few entries. Clearly this task is impossible without prior information or assumptions about 
the matrix. If we believe that a low-rank matrix will approximate the truth well, a common heuristic is to 
minimize the squared error subject to a nuclear norm bound. For background in the theory and practice of 
matrix completion under this assumption see [10]. We solve the following optimization problem: 

min \\M{A)-yf. 

\\A\\,<r 

Here M is the masking operator, that is, the linear operator that maps a matrix A G to the vector 

containing its observed entries, and y is the vector of observed entries. We can rephrase this in our notation 
by letting 0 = {(u,v) G U" x K™ : jjujj 2 = llwll 2 = 1}, ipdujv)) = M(uv'^), and £(■) = jj • jp. In §6.1, we 
show that our algorithm achieves state of the art results on the Netflix Challenge, a standard benchmark in 
matrix completion. 

Bayesian experimental design. In experimental design we seek to estimate a vector a; G from 
measurements of the form 

-f e,. 

Here / : 0 —is a known differentiable feature function and are independent noise terms. We want to 
choose 9i,... ,9k to minimize our uncertainty about x — if each measurement requires a costly experiment, 
this corresponds to getting the most information from a fixed number of experiments. For background, 
see [36]. 

In general, this task in intractable. However, if we assume are independently distributed as standard 
normals and x comes from a standard normal prior we can analytically derive the posterior distribution of 
X given yi,... ,yk, as the full joint distribution oi x,yi,... ,ym is normal. 

One notion of how much information yi,..., y^ carry about x is the entropy of the posterior distribution 
of X given the measurements. We can then choose 9i,... ,9k to minimize the entropy of the posterior, which 
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is equivalent to minimizing the (log) volume of an uncertainty ellipsoid. With this setup, the posterior 
entropy is (up to additive constants and a positive multiplicative factor) simply 

- log det ^ /(6l,)/(6»i)^ 

To put this in our framework, we can take iIj{9) = f{O)f{0)^, 2 / = 0 and £{M) = — log det(/ + M)~^. We 
relax the requirement to choose exactly k measurement parameters and instead search for a sparse measure 
with bounded total mass, giving us an instance of (1.3). 

Fitting mixture models to data. Given a parametric distribution P{x\6) we consider the task of recov¬ 
ering the components of a mixture model from i.i.d. samples. For background see [29]. To be more precise, 
we are given data {xi,..., Xd} sampled i.i.d. from a distribution of the form P{x) = P{x\9)tt{9). The 

task is to recover the mixing distribution tt. If we assume tt is sparse, we can phrase this as a sparse inverse 
problem. To do so, we choose tpW = ^ common choice for £ is the (negative) log-likelihood 

of the data: i.e., y = 0, £{p) = — logp^. The obvious constraint is f d7r(9) < 1. 

Design of numerical quadrature rules. In many numerical computing applications we require fast 
procedures to approximate integration against a fixed measure. One way to do this is use a quadrature rule: 

k 

f{9)dp{9) ~^w^f{xi). 

i=l 

The quadrature rule, given by iCi € K and Xi G 0, is chosen so that the above approximation holds for 
functions / in a certain function class. The pairs (xi,Wi) are known as quadrature nodes. In practice, we 
want quadrature rules with very few nodes to speed evaluation of the rule. 

Often we don’t have an a priori description of the function class from which / is chosen, but we might 
have a hnite number of examples of functions in the class, /i,..., fd, along with their integrals against p, 
yi,... ,yd- In other words, we know that 



J M0)dp{9) = yi. 


A reasonable quadrature rule should approximate the integrals of the known fi well. 

We can phrase this task as a sparse inverse problem where each source is a single quadrature node. In 
our notation, tp{9) = {fi{9),... ,fd{9)). Assuming each function fi is differentiable, ip is differentiable. A 
common choose of £ for this application is simply the squared loss. Note that in this application there is no 
need to constraint the weights to be positive. 

Neural spike identification. In this example we consider the voltage v recorded by an extracellular 
electrode implanted in the vicinity of a population of neurons. Suppose that this population of neurons 
contains K types of neurons, and that when a neuron of type k fires at time t S K, an action potential of 
the form '0(t, k) is recorded. Here : K x {1,..., K} —)■ is a vector of voltage samples. If we denote the 
parameters of the ith neuron by 9i = {ti,ki), then the total voltage v G can be modeled as a superposition 
of these action potentials: 

K 

V=y^^w^ip{9i). 

i=l 

Here the weights Wi > 0 can encode the distance between the fth neuron and the electrode. The sparse 
inverse problem in this application is to recover the parameters ,..., 9x and weights wi,, wk from the 
voltage signal v. For background see [15]. 
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Designing radiation therapy. External radiation therapy is a common treatment for cancer in which 
several beams of radiation are fired at the patient to irradiate tumors. The collection of beam parameters 
(their intensities, positions, and angles) is called the treatment plan, and is chosen to minimize an objective 
function specified by an oncologist. The objective usually rewards giving large doses of radiation to tumors, 
and low dosages to surrounding healthy tissue and vital organs. Plans with few beams are desired as 
repositioning the emitter takes time—increasing the cost of the procedure and the likelihood that the patient 
moves enough to invalidate the plan. 

A beam fired with intensity w > 0 and parameters 6 delivers a radiation dosage G Here the 

output is interpreted as the radiation delivered to each of d voxels in the body of a patient. The radiation 
dosage from beams with parameters 6i,... ,9 k and intensities Wi,..., wk add linearly, and the objective 
function is convex. For background see [23]. 


3 Conditional gradient method 

In this section we present our main algorithmic development. We begin with a review of the classical 
conditional gradient method (CGM) for finite-dimensional convex programs. We then translate the classical 
CGM for the sparse inverse problem (1.3). In particular, we augment this algorithm with an aggressive local 
search subroutine that significantly improves the practical performance of the CGM. 

The classical GGM solves the following optimization problem: 

minimize/(a;), (3-1) 

x^C 

where C is a closed, bounded, and convex set and / is a differentiable convex function. 

CGM proceeds by iteratively solving linearized versions of (3.1). At iteration k, we form a linear approx¬ 
imation to the function / at the current point Xk- We then minimize the linearization over the feasible set to 
get a potential solution Sk. This step can be interpreted as a restricted steepest descent: the linear functional 
represented by the gradient is simply the directional derivative. As Sk minimizes a simple approximation of 
/ that degrades with distance from Xk we take a convex combination of Sk and Xk as the next iterate. We 
summarize this method in Algorithm 1. 


Algorithm 1 Gonditional gradient method (GGM) 

For k = 1,... fcmax 

1. Linearize: fkis) ^ f{xk) + f{xk), s - Xk). 

2. Minimize: Sk 5 argmin^g^^ fk{s). 

3. Tentative update: Xk+i ^ 

4. Final update: Choose Xk+i such that f{xk+i) < f(xk+i). 


It is important to note that minimizing fk{s) over the feasible set C in step 2 may be quite difficult and 
requires an application-specific subroutine. 

One of the more remarkable features of the CGM is step 4. While the algorithm converges using the 
tentative update in step 4, all of the convergence guarantees of the algorithm are preserved if one replaces 
Xk+i with any feasible Xk+i that achieves a smaller value of the objective. There are thus many possible 
choices for the final update in step 4, and the empirical behavior of the algorithm can be quite different for 
different choices. One common modification is to do a line-search: 

Xk+i = argmin f{x). 

rEGconv(a:fc ,Sfc) 
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We use conv to denote the convex hull—in this last example, a line segment. Another variant, the fully- 
corrective conditional gradient method, chooses 


Xk+i = argmin f{x). 

x^conv{xk ,si ) 

In the next section, we propose a natural choice for this step in the case of measures that uses local search 
to speed-up the convergence of the CGM. 

One appealing aspect of the CGM is that it is very simple to compute a lower bound on the optimal 
value /* as the algorithm runs. By convexity of /, we have 

/(s) > fixk) + {s- Xfc, V/(a;fc)) = fk{s) 

for any s € C. Minimizing both sides over s gives us the elementary bound 

/* > fk{sk)- 

The right hand side of this inequality is readily computed after step (2). 

3.1 CGM for sparse inverse problems 

In this section we translate the classical CGM for the sparse inverse problem (1.3). We give two versions— 
first a direct translation of the fully corrective variant and then our improved algorithm for differentiable 
measurement models. To make it clear that we operate over the space of measures we change notation and 
denote the iterate by instead of Xk- The most obvious challenge is that we cannot represent a general 
measure on a computer unless it is hnitely-supported. We will see however that the steps of CGM can in fact 
be carried out on a computer in this context. Moreover we later prove that the iterates can be represented 
with bounded memory. 

Before we describe the algorithm in detail, we first explain how to linearize the objective and minimize 
the linearization. In the space of measures, linearization is most easily understood as a directional derivative: 
in the finite dimensional case, we always have that 

(V/(/r/c), s) = Dsfink) ■■= Ihn ^ 

In our formulation (1.3), f{p) = £($^fc — y). If we dehne the residual error as rk = ^yik — y, we can 
compute the directional derivative of our particular choice of / at fj,k as 

D./(«) = Inn ^ C(r> + tt.) - Ur,) ^ ^ 

40 ^ 40 ^ ^ ^ ^ 

Here, the inner product on the right hand side of the equation is the standard inner product in 

The second step of the CGM minimizes the linearized objective over the constraint set. In other words, we 
minimize (Vf'(rfc), $s) over a candidate nonnegative measure s with total mass bounded by r. Interchanging 
the integral (in <I>) with the inner product, and defining F{0) := (Vt'(rfe), ^/>(0)), we need to solve the 
optimization problem: 

minimize [ F(6)ds(6). (3.3) 

s>0, s(0)<r y 

The optimal solution of (3.3) is the point-mass rdg^, where 0* G argminF(0) (unless F(d) is positive 
everywhere in which case the optimal solution is the 0 measure). This means that at each step of the GGM 
we need only add a single point to the support of our approximate solution fik ■ Moreover we prove that our 
algorithm produces iterates with support on at most d -I- 1 points (see Theorem 5.1). 

We now describe the fully-corrective variant of the CGM for sparse inverse problems (Algorithm 2). The 
state of the algorithm at iteration fc is a nonnegative atomic measure fj,k supported on a finite set Fk with 






Figure 1: The three plots above show the first three iterates of the fully corrective CGM in a simulated 
superresolution imaging problem with two point sources of light. The locations of the true point sources are 
indicated by green stars, and the greyscale background shows the pixelated image. The elements of Sk for 
k = 1,2,3 are displayed by red dots. 


mass /rfe({d}) on points 9 G Sk- The algorithm alternates between selecting a source to add to the support, 
and tuning the weights to lower the current cost. This tuning step (Step 4) is a finite-dimensional convex 
optimization problem that we can solve with an off-the-shelf algorithm. 


Algorithm 2 Gonditional gradient method for measures (CGM-M) 

For fc = I : fcmax 


I. Gompute gradient of loss: 

gk = V€($^fc_i - y). 

2. Compute next source: 

0k G aTgmin{gk,ip{9)). 
oee 

3. Update support: 

Sk G- Sk-i U {9k}- 

4. Compute weights: 

fik^ argmin ^ (EeeS;, - 2/) ■ 

At>0, 

5. Prune support: 

Sk G- SUpp(/ife). 


While we can simply run for a fixed number of iterations, we may stop early using the standard CGM 
bound. With a tolerance parameter e > 0, we terminate when the conditional gradient bound assures us 
that we are at most e-suboptimal. In particular, we terminate when 

- T{il){9k),gk)+ < £• (3.4) 

Unfortunately, CGM-M does not perform well in practice. Not only does it converge very slowly, but the 
solution it finds is often supported on an undesirably large set. As illustrated in Figure 1, the performance 
of CGM-M is limited by the fact that it can only change the support of the measure by adding and removing 
points; it cannot smoothly move Sk within 0. Figure 1 shows CGM-M applied to an image of two closely 
separated sources. The first source 9i is placed in a central position overlapping both true sources. In 
subsequent iterations sources are placed too far to the right and left, away from the true sources. To move 
the support of the candidate measure requires GGM-M to repeatedly add and remove sources; it is clear that 
the ability to move the support smoothly within the parameter space would resolve this issue immediately. 

In practice, we can speed up convergence and find significantly sparser solutions by allowing the support 
to move continuously within 0. The following algorithm, which we call the alternating descent conditional 
gradient method (ADGG), exploits the differentiability of ip to locally improve the support at each iteration. 
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Algorithm 3 Alternating descent conditional gradient method (ADCG) 

For k = 1 : fcmax 

1. Compute gradient of loss: 

gk V£(^/j,k-i - y)- 

2. Compute next source: 

Choose 9k G argmin('0(d),5fc). 

6»ee 

3. Update support: 

Sk t— Sk-i U {9k}- 

4. Coordinate descent on nonconvex objective: 

Repeat: 

(a) Compute weights: 

yk^ argmin ^ (LeeSt “ 2/) 

/i>0, ^(Sfc)<T 

m(SD=o 

(b) Prune support: 

Sk = support (/Tfc). 

(c) Locally improve support: 

Sk — local_descent((0, ^fc({0})) : 9 G Sk)- 


Here locaLdescent is a subroutine that takes a measure /i^ with atomic representation (0i, rci),..., {0^, Wm) 
and attempts to use gradient information to reduce the function 

m 

(6>i,..., 1 -^ - y), 

i=l 

holding the weights fixed. 

When the number of sources is held fixed, the optimization problem 

C k 

- y 

i=i 

^ 0 (3.5) 

G 0 

Wj < T 
i 

is nonconvex. Step 4 is then block coordinate descent over Wi and 9i. The algorithm as a whole can be 
interpreted as alternating between performing descent on the convex (but infinite-dimensional) problem (1.3) 
in step 2 and descent over the finite-dimensional (but nonconvex) problem (3.5) in step 4. The bound (3.4) 
remains valid and can be used as a termination condition. 

As we have previously discussed, this nonconvex local search does not change the convergence guarantees 
of the CGM whatsoever. We will show in Section 5 that this is an immediate consequence of the existing 
theory on the CGM. However, as we will show in §6, the inclusion of this local search dramatically improves 
the performance of the CGM. 

3.2 Interface and implementation 

Roughly speaking, running ADCG on a concrete instance of (1.3) requires subroutines for two operations. 
We need algorithms to approximately compute: 

(a) ip{9) and ^V’(^) for 0 G 0. 

(b) argmin(^/>(0), u) for arbitrary vectors v G 
see 


minimize 
subject to 
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Computing (a) is usually straightforward in applications with differentiable measurement models. Comput¬ 
ing (b) is not easy in general. However, there are many applications of interest where (b) is tractable. For 
example, if the parameter space 0 is low-dimensional, then the ability to compute (a) is sufficient to ap¬ 
proximately compute (b) : we can simply grid the parameter space and begin local search using the gradient 
of the function 9 i—)■ {ijj{9),v). Note that because of the local improvement step, ADCG works well even 
without exact minimization of (b). We prove this fact about inexact minimization in Section 5. 

If the parameter space is higher dimensional, however, the feasibility of computing (b) will depend on the 
specific application. One example of particular interest that has been studied in the context of the CGM is 
matrix completion [25, 38, 20, 49]. In this case, the (b) step reduces to computing the leading singular vectors 
of a sparse matrix. We will show that adding local improvement to the CGM accelerates its convergence on 
matrix completion in the experiments. 

We also note that in the special case of linear system identification, 0 is 6 dimensional, which is just 
large enough such that gridding is not feasible. In this case, we show that we can reduce the 6-dimensional 
optimization problem to a 2-dimensional problem and then again resort to gridding. We expect that in many 
cases of interest, such specialized solvers can be applied to solve the selection problem (b). 

4 Related work 

There has recently been a renewed interest in the conditional gradient method as a general purpose solver for 
constrained inverse problems [24, 20]. These methods are simpler to implement than the projected or proximal 
gradient methods which require solving a quadratic rather than linear optimization over the constraint set. 

The idea of augmenting the classic conditional gradient method with improvement steps is not unique 
to our work. Indeed, it is well known that any modification of the iterate that decreases the objective 
function will not hurt theoretical convergence rates [24]. Moreover, Rao et al [38] have proposed a version of 
the conditional gradient method, called GoGENT, for atomic norm problems that take advantage of many 
common structures that arise in inverse problems. The reduction described in our theoretical analysis makes 
it clear that our algorithm can be seen as an instance of GoGENT specialized to the case of measures and 
differentiable measurement models. 

The most similar proposals to ADGG come from the special case of matrix completion or nuclear-norm 
regularized problems. Several papers [49, 30, 20, 25] have proposed algorithms based on combinations of 
rank-one updates and local nonconvex optimization inspired by the well-known heuristic of [8]. While our 
proposal is significantly more general, ADCG essentially recovers these algorithms in the special case of 
nuclear-norm problems. 

We note that in the context of inverse problems, there are a variety of algorithms proposed to solve the 
general infinite-dimensional problem (1.3). Tang et al [46] prove that this problem can be approximately 
solved by gridding the parameter space and solving the resulting finite dimensional problem. However, 
these gridding approaches are not tractable for problems with parameter spaces even of relatively modest 
dimension. Moreover, even when gridding is tractable, the solutions obtained are often supported on very 
large sets and heuristic post-processing is required to achieve reasonable performance in practice [46]. In 
spite of these limitations, gridding is the state of the art in many application areas including computational 
neuroscience [15], superresolution fluorescence microscopy [50], radar [5, 21], remote sensing [17], compressive 
sensing [4, 12, 13], and polynomial interpolation [39]. 

There have also been a handful of papers that attempt to tackle the infinite-dimensional problem without 
gridding. For the special case where £{■) = ]] -U^, Bredies and Pikkarainen [7] propose an algorithm to solve the 
Tikhonov-regularized version of problem (1.3) that is very similar to Algorithm 3. They propose performing 
a conditional gradient step to update the support of the measure, followed by soft-thresholding to update 
the weights. Finally, with the weights of the measure hxed they perform discretized gradient flow over the 
locations of the point-masses. However, they do not solve the finite-dimensional convex problem at every 
iteration, which means there is no guarantee that their algorithm has bounded memory requirements. For 
the same reason, they are limited to one pass of gradient descent in the nonconvex phase of the algorithm. 
In §6 we show that this limitation has serious performance implications in practice. 
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5 Theoretical guarantees 

In this section we present a few theoretical results. The first guarantees that we can run our algorithm 
with bounded memory. The second result guarantees that the algorithm converges to an optimal point and 
bounds the worst-case rate of convergence. 

5.1 Bounded memory 

As the CGM for measures adds one point to the support of the iterate per iteration, we know that the 
cardinality of the support of /i^ is bounded by k. For large fc, then, could have large support. The 
following theorem guarantees that we can run our algorithm with bounded memory and in fact we need only 
store at most d + \ points, where d is the dimension of the measurements. 

Theorem 5.1. ADCG may be implemented to generate iterates with cardinality of support uniformly bounded 
by d+1. 

Proof. Lemma (5.2) allows us to conclude that the fully-corrective step ensures that the support of the 
measure remains bounded by d -I- 1 for all iterations. □ 

Lemma 5.2. The finite-dimensional problem 

minimize WiiflOA — y) (5.1) 

has an optimal solution w* with at most d 1 nonzeros. 

Proof. Let u* be any optimal solution to (5.1). As u* is feasible, we have that 

u = ^ e rconv({'0(di) : i = 1,..., to} U {0}). 

i 

In other words, ^ lies in the convex hull of a set in Caratheodory’s theorem immediately tells us that 
^ can be represented as a convex combination of at most d -I- 1 points from {ifiOi) : i = 1,..., to}. That is, 
there exists a ic* with at most d -I- 1 nonzeros such that 

m 

= V. 

i=l 


This implies that w* is also optimal for (5.1). □ 

Note that in order to find w*, we need to either use a simplex-type algorithm to solve (5.1) or explore 
the optimal set using the random ray-shooting procedure as described in [45]. 

5.2 Convergence analysis 

We now analyze the worst-case convergence rate for ADCG applied to (1.3). Theorem 5.3 below guarantees 
that ADCG achieves accuracy 5 in 0(y) iterations. 

The theorem applies even when the linear minimization step is performed approximately. That is, we 
allow 0k to be chosen such that 


{'^{0k),gk) < mm(V’(6»), gk) + (5.2) 

for some C ^ 0- When inequality (5.2) holds, we say that the linear minimization problem in iteration k is 
solved to precision f. 
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The analysis relies on a finite-dimensional optimization problem equivalent to (1.3). Let A = '■ 

0 € 0}U {0}. Readers familiar with the literature on atomic norms [11] will recognize the finite-dimensional 
problem we consider as an atomic norm problem: 

minimize 

subject to 

The connection to (1.3) becomes clear if we note that rconv^ = : /i > 0,fi{9) < r}. Any feasible 

measure for (1.3) gives us a feasible point for (5.3). Likewise, any feasible x for (5.3) can be decomposed 
as a feasible measure /r for (1.3). Furthermore, these equivalences preserve the objective value. 

Before we state the theorem precisely, we introduce some notation. Let £* = £($/.t* — y) denote the 
optimal value of (1.3) —the discussion above implies that £* is also the optimal value of (5.3). Following 
Jaggi in [24], we define the curvature parameter Cf^s of a function / on a set S. Intuitively, Cf^s measures 
the maximum divergence between / and its first-order approximations, f{z;x) = f(x) -\- {z — x,Vf{x)): 

Cf,s= sup ^{f{z)- f{z;x)). 

x,sGS 1 
7e[o,i] 

2=X+7(s —x) 

Theorem 5.3. Let C he the curvature parameter of the function f{x) = t(x — y) on the set rconvA. If each 
linear minimization subproblem is solved to precision Cf, the iterates pLi,y 2 ,--- of ADCG applied to (1.3) 
satisfy 

2C 

£($/rfc - y) - 4 < + 0- 

Proof. We first show that the points ^^ 2 , ■ ■ ■ are iterates of the standard CGM (with a particular choice 
of the final update step) applied to the finite-dimensional problem (5.3). We then appeal to [24] to complete 
the proof. 

Suppose that = Xk- We show that the linearization step in both algorithms produces the same 
result (up to the equivalence mentioned earlier). Let Ok+i = arg V£($/ife — y)) be the output of 

step 2 of ADCG. Let Sk be the output of the linear minimization step of the standard CGM applied to (5.3) 
starting at a;*,. Then 

Sk = argmin (s, V^(a;fc - ?/)). 

sGTconVw 4 

Recalling that convA = | /r > 0, /i(0) < 1}, we must have Sk = Til;{0k)- Therefore, the linear 

minimization steps of the standard CGM and ADCG coincide. 

We now need to show that the nonconvex coordinate descent step in ADCG is a valid final update step 
for the standard CGM applied to (5.3). This is clear as the coordinate descent step does at least as well as 
the fully-corrective step. We can hence appeal to the results of Jaggi [24] that bound the convergence rate 
of the standard GGM on finite-dimensional problems to finish the proof. □ 

6 Numerical results 

In this section we apply ADCG to three of the examples in §2: superresolution fluorescence microscopy, 
matrix completion, and system identification. We have made a simple implementation of ADCG publicly 
available on github: 

https://github.com/nboyd/SparseInverseProblems.jl. 

This allows the interested reader to follow along with these examples, and, hopefully, to apply ADCG to 
other instances of (1.3). 

For each example we briefly describe how we implement the required subroutines for ADCG, though again 
the interested reader may want to consult our code for the full picture. We then describe how ADCG compares 


i{x - y) 
x S rconvM. 


(5.3) 
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to prior art. Finally, we show how ADCG improves on the standard fully-corrective conditional gradient 
method for measures (CGM-M) and a variant of the gradient flow algorithm (GF) proposed in [7]. While the 
gradient flow algorithm proposed in [7] does not solve the finite-dimensional convex problem at each step, our 
version of GF does. We feel that this is a fair comparison: intuitively, fully solving the convex problem can 
only improve the performance of the GF algorithm. All three experiments require a subroutine to solve the 
finite-dimensional convex optimization problem over the weights. For this we use a simple implementation 
of a primal-dual interior point method, which we include in our code package. 

For each experiment we select the parameter r by inspection. For matrix completion and linear system 
ID this means using a validation set. For single molecule imaging each image requires a different value of 
T. For this problem, we run ADCG with a large value of r and stop when the decrease in the objective 
function gained by the addition of a source falls below a threshold. This heuristic can be viewed as post-hoc 
selection of r and the stopping tolerance e, or as a stagewise algorithm [48]. 

The experiments are run on a standard c4.8xlarge EG2 instance. Our naive implementations are meant 
to demonstrate that ADGG is easy to implement in practice and finds high-quality solutions to (1.3). For 
this reason we do not include detailed timing information. 

6.1 Superresolution fluorescence microscopy 

We analyze data from the Single Molecule Localization Microscopy (SMLM) challenge [42, 18]. Fluorescence 
microscopy is an imaging technique used in the biological sciences to study subcellular structures in vivo. 
The task is to recover the 2D positions of a collection of fluorescent proteins from images taken through an 
optical microscope. 

Here we compare the performance of our ADGG to the gridding approach of Tang et al ]46], two algorithms 
from the microscopy community (quickPALM and center of Gaussians), and also CGM and the gradient flow 
(GF) algorithm proposed by [7]. The gridding approach approximately solves the continuous optimization 
problem (1.3) by discretizing the space 0 into a finite grid of candidate point source locations and running an 
£i-regularized regression. In practice there is typically a small cluster of nonzero weights in the neighborhood 
of each true point source. With a fine grid, each of these clusters contains many nonzero weights, yielding 
many false positives. 

To remove these false positives, Tang et al propose a heuristic post-processing step that involves taking 
the center of mass of each cluster. This post-processing step is hard to understand theoretically, and does 
not perform well with a high-density of fluorophores. 

6.1.1 Implementation details 

For this application, the minimization required in step 2 of ADCG is not difficult: the parameter space is 
two-dimensional. Coarse gridding followed by a local optimization method works well in theory and practice. 

For locaFdescent we use a standard constrained gradient method provided by the NLopt library [27]. 

6.1.2 Evaluation 

We measure localization accuracy by computing the Fi score, the harmonic mean of precision and recall, 
at varying radii. Computing the precision and recall involves hrst matching estimated point sources to 
true point sources—a difficult task. Fortunately, the SMLM challenge website [18] provides a stand-alone 
application that we use to compute the Fi score. 

We use a dataset of 12000 images that overlay to form simulated microtubules (see Figure 2) available 
online at the SMLM challenge website [18]. There are 81049 point sources in total, roughly evenly distributed 
across the images. Figure 2a shows a typical image. Each image covers an area 6400nm across, meaning 
each pixel is roughly lOOnm by lOOnm. 

Figure 3 compares the performance of ADCG, gridding, quickPALM, and center of Gaussians (GoG) on 
this dataset. We match the performance of the gridding algorithm from [46], and signihcantly beat both 
quickPALM and GoG. Our algorithm analyses all images in well under an hour—significantly faster than the 
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Figure 2: The long sequence dataset contains 12000 images similar to (a). The recovered locations for all 
the images are displayed in (b). 


gridding approach of [46]. Note that the gridding algorithm of [46] does not work without a post-processing 
step. 

6.2 Matrix completion 

As described in §2, matrix completion is the task of estimating an approximately low rank matrix from some 
of its entries. We test our proposed algorithm on the Netflix Prize dataset, a standard benchmark for matrix 
completion algorithms. 

6.2.1 Implementation details 

Although the parameter space for this example is high-dimensional we can still compute the steepest descent 
step over the space of measures. The optimization problem we need to solve is: 

ininimize (■(/'(a, 6), i/) = (M(a6^),i/) = {ab^, M*(v)). 

I|a||2 = ll&ll2=l 

In other words, we need to find the unit norm, rank one matrix with highest inner product with the 
matrix M*i/. The solution to this problem is given by the top singular vectors of M*^. Computing the top 
singular vectors using a Lanczos method is relatively easy as the matrix M*v is extremely sparse. 

Our implementation of locaLdescent takes a single step of gradient descent with line-search. 

6.2.2 Evaluation 

Our algorithm matches the state of the art for nuclear norm based approaches on the Netflix Prize dataset. 
Briefly, the task here is to predict the ratings 480,189 Netflix users give to a subset of 17,770 movies. One 
approach has been to phrase this as a matrix completion problem. That is, to try to complete the 480,189 
by 17,770 matrix of ratings from the observed entires. Following [40] we subtract the mean training score 
from all movies and truncate the predictions of our model to lie between 1 and 5. 

Figure 4 shows root-mean-square error (RMSE) of our algorithm and other variants of the CGM on 
the Netflix probe set. Again, ADCG outperforms all other CGM variants. Our algorithm takes over 7 
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Tolerance radius in nm 


Figure 3: Performance on bundled tubes: long sequence. F-scores at various radii for 6 algorithms. 
For reference, each image is 6400nm across, meaning each pixel has a width of lOOnm. ADCG outperforms 
all competing methods on this dataset. 


hours to achieve the best RMSE—this could be improved with a more sophisticated implementation, or 
parallelization. 


6.3 System identification 

In this section we apply our algorithms to identifying two single-input single-output systems from the DalSy 
collection [14]: the flexible robot arm dataset (ID 96.009) and the hairdryer dataset (ID 96.006). 

6.3.1 Implementation details 

While the parameter space is 6-dimensional, which effectively precludes gridding, we can efficiently solve 
the minimization problem in step (2) of the ADCG. To do this, we grid only over r and a: the output is 
linear in the remaining parameters (B and xq) allowing us to analytically solve for the optimal B and xq as 
a function of r, a. 

For locaLdescent we again use a standard box-constrained gradient method provided by the NLopt 
library [27]. 


6.3.2 Evaluation 


Both datasets were generated by driving the system with a specific input and recording the output. The 
total number of samples is 1000 in both cases. Following [43] we identify the system using the first 300 time 
points and we evaluate performance by running the identified system forward for the remaining time points 
and compare our predictions to the ground truth. 

We evaluate our predictions ?/pred using the score defined in [19]. The score is given by 


score = 100 



||?/pred I/II2 \ 

II ymean 1/1| 2 / 


( 6 . 1 ) 


where j/mean is the mean of the test set y. 

Figure 5 shows the score versus the number of sources as we run our algorithm. For reference we display 
with horizontal lines the results of [19]. ADCG matches the performance of [19] and exceeds that of all 
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Figure 4: RMSE on Netflix challenge dataset. ADCG significantly outperforms CGM-M. 
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other CGM variants. Our simple implementation takes about an hour, which compares very poorly with the 
spectral methods in [19] which complete in under a minute. 


7 Conclusions and future work 

As demonstrated in the numerical experiments of §6, ADCG achieves state of the art performance in 
superresolution fluorescence microscopy, matrix completion, and system identification, without the need 
for heuristic post-processing steps. The addition of the nonconvex local search step locaLdescent signifi¬ 
cantly improves performance relative to the standard conditional gradient algorithm in all of the applications 
investigated. In some sense, we can understand ADCG as a method to rigorously control local search. One 
could just start with a model expansion (1.1) and perform nonconvex local search. However, this fares far 
worse than ADCG in practice and has no theoretical guarantees. The ADCG framework provides a clean 
way to generate a globally convergent algorithm that is practically efficient. Understanding this coupling 
between local search heuristics and convex optimization leads our brief brief discussion of future work. 

Tighten convergence analysis for ADCG. The conditional gradient method is a robust technique, 
and adding our auxiliary local search step does not change its convergence rate. However, in practice, the 
difference between the ordinary conditional gradient method, the fully corrective variants, and ADGG are 
striking. In many of our experiments, ADCG outperforms the other variants by appreciable margins. Yet, 
all of these algorithms share the same upper bound on their convergence rate. A very interesting direction of 
future work would be to investigate if the bounds for ADCG can be tightened at all to be more predictive of 
practical performance. There may be connections between our algorithm and other alternating minimization 
techniques popular in matrix completion [28, 26], sparse coding [1, 2], and phase retrieval [33], and perhaps 
the techniques from this area could be applied to our setting of sparse inverse problems. 

Relaxation to clustering algorithms. Another possible connection that could be worth exploring is the 
connection between the CGM and clustering algorithms like k-means. Theoretical bounds have been devised 
for initialization schemes for clustering algorithms that resemble the first step of CGM [3, 34]. In these 
methods, k-means is initialized by randomly seeking the points that are farthest from the current centers. 
This is akin to the first step of CGM which seeks the model parameters that best describe the residual error. 
Once a good seeding is acquired, the standard Lloyd iteration for k-means can be shown to converge to the 
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Figure 5: Performance on DalSy datasets. ADGG outperforms other GGM variants and matches the 
nuclear-norm based technique of [43]. 


global optimal solution [34]. It is possible these analyses could be generalized to analyze our version of CGM 
or inspire new variants of the GGM. 

Connections to cutting plane methods and semi-infinite programs. The standard Lagrangian dual 
of (1.3) is a semi-infinite program (SIP), namely an optimization problem with a finite dimensional decision 
variable but an infinite collection of constraints [22, 44]. One of the most popular algorithmic techniques for 
SIP is the cutting plane method, and these methods qualitatively act very much like the CGM. Exploring 
this connection in detail could generate variants of cutting plane methods suited for continuous constraint 
spaces. Such algorithms could be valuable tools for solving semi-infinite programs that arise in contexts 
disjoint from sparse inverse problems. 

Other applications. We believe that our techniques are broadly applicable to other sparse inverse prob¬ 
lems, and hope that future work will explore the usefulness of ADCG in areas unexplored in this paper. To 
facilitate the application of ADCG to more problems, such as those described in §2, we have made our code 
publicly available on GitHub. As described in §3, implementing ADCG for a new application essentially 
requires only two user-specified subroutines: one routine that evaluates the the measurement model and its 
derivatives at a specified set of weights and model parameters, and one that approximately solves the linear 
minimization in step 2 of ADCG. We aim to investigate several additional applications in the near future to 
test the breadth of the efficacy of ADCG. 
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